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n HTTP log file records every request that a server accepts, 
J|i£j|§|jnduding information such as the date and time of the request, 
MfiT^^^kie client submitting it, the file requested, and so on. 1 This neat 
source of information establishes a basis for observations that can lead to 
improved overall performance for a given Web site. One of these observa- 
tions, page popularity, is the subject of this article. 

We examine how to define page popularity in a meaningful way and 
show how using page popularity to rearrange a Web site can lead to a sub- 
stantially more accessible and more effective hypertext scheme. (In a sim- 
ilar vein, Golovchinsky has explored automatic methods of link construc- 
tion based on feedback from users collected during browsing. 2 ) 

The notion of log file analysis is not new (see the sidebar, "Web 
Resources for Improving Navigation" on page 24), but our approach uses 
an analysis of relative page popularity to automatically reorganize the pages 
of a Web site. We have developed pilot software to perform these func- 
tions and show here that the resulting improvement in navigation boost- 
ed page accesses in five different Web sites. 

DEFINING PAGE POPULARITY 

One easy way to estimate a page's popularity is to count the accesses to this 
page based exclusively on a given log file. (See http://www.w3.org/Dae- 
mon/LJser/Config/Logging.html for all the http connection properties writ- 
ten to the log file.) However, counting these absolute accesses (AAs) from the 
log file can be misleading. A page close to the home page (the server's initial 
page) will probably have more absolute accesses, because it stands on a path 
between the home page and target pages located deeper in the HTML "tree", 
where the home page Is the root, and every hyperlink to another page con- 
stitutes a parent-child relationship. (Because a child page might have sever- 
al links toward one of its ancestors, die term "tree" is not wholly felicitous.) 

To better examine a page's popularity, we could take the following fac- 
tors into consideration: 
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■ the depth of the page (how many steps It is 

from the home page) , d: 
m the number of pages at the same depth as the 

page being examined, n\ and 
a the number of references (hyperlinks) to this 

particular page from other pages of the server, r. 

Lets assume there is a factor a that embraces all of 
these parameters. Then, we can coin a new term, 
relative accesses (RA), which we deri ve from the fol- 
lowing equation: 

RA = 5 * AA (1) 

Defining the Coefficient a 

Because page depth d can detract from the popu- 
larity of a page, we can reasonably assume that coef- 
ficient a must be proportional to d. Likewise, the 
number of pages at the same depth, n d , should be 
proportional to a because the larger the number of 
choices, the greater the significance of selecting a 
specific page. Of course, references from other 
pages, r, generally bolster page popularity, so rmust 
be in inverse proportion to coefficient a. 

Recent Web servers can be configured to track 
down Web pages that have links to any of their 
pages and store the addresses in what is called the 
refenerhg. Most Web browsers support this feature 
and thus send the reference (link) to a server's page 
simultaneously with the browser s HTTP request. 

Let rbe an estimate based on the number of ref- 
erencfjs from a server's pages to a page on the same 
server. The relation between a h d h n jt and rj — the 
values of a, d t n, and rfor page / -should have the 
following form: 

a^Hd.n, Vr),i=l ...K, (2) 

where Kls the number of Web pages and Fis a spe- 
cific function that can be defined to reflect a user's 
behavior — what pages the user visits, in what order, 
for how long, and so on— during navigation to a 
specific Web site. 

Link-Editing Software 

We have developed pilot software, Soala Version 
2.0, 3 that reorganizes the links between pages 
according to criteria set by the Web server's admin- 
istrator. Tn the following measurements, if the rel- 
ative popularity of a given page exceeds the popu- 
larity of at least one ancestor page (a page with 
shorter distance to the home page, standing in its 
path), the software rearranges the links. 



Soala can assess relative page popularity period- 
ically (every week, for example). The most instruc- 
tive of the metrics it calculates are 

B AA — number of absolute accesses per page; 
B RA — number of relative accesses per page; 

■ PT — mean page time, how much time a user 
spends on a specific page; 

a UT — mean user time, how much time a user 
spends on the server every session; and 

■ NP — mean number of pages, how many dif- 
ferent pages a user visits every session. 



between pages mzmrMmg m 
mmrim mi by t§i« lllfeb server % 
administrator. 



In our analyses, a new session begins when a user first 
accesses the Web site. All successive accesses belong 
to the same session provided that the inteival between 
them does not exceed a given time threshold. 

A simple version of our link -editing algorithm is 
"If any page has an RA higher than its parent, inter- 
change it with its parent page. Repeat the previous 
step until every page has ancestors with higher RAs." 

Case Study Definition 

We applied this algorithm for a specific definition 
of factor aj to show how an awkward organization 
of a Web server can discourage Internet users and 
how link editing (based on relative accesses) can 
remedy the problem. 

To ensure that our results would be as accurate 
as possible, we allowed only authenticated users 
with valid usernames and passwords to access the 
Web server for this case study. The authentication 
of usere is not crucial for the link-editing algorithm; 
we introduced this requirement for evaluation pur- 
poses only Furthermore, there were no references 
to our site from outside our server, so the value of 
parameter r } was accurate. We carefully selected the 
group of users, as well as the pages, to facilitate the 
interpretation of the results. 

According to our earlier discussion, one possi- 
ble way of specifying coefficient a } is 

a t =c x * df+ci* n/f! (3) 
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where c { and c ? arc constants whose values vary 1}' a challenging issue. Currently, we know that q and 
according to the structure of the Web site. r 2 can significantly affect the algorithms behavior, 

The definition of parameters C\ and is definite- but we are still researching what values will boost its 
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(a) (b) 

Figure 1. The link-editing algorithm reorganizes the pages in a Web site: (a) initial HTML structure and (b) revised 
HTML structure. 



performance. For example, the higher we make the 
value of C], the more vigorously the algorithm pro- 
motes pages deeper in the HTML structure. In some 
cases (when the value of c\ is extremely high), this 
leads to volatile page arrangements: after each exe- 
cution of the I ink -editing algorithm, a new 
rearrangement occurs. This is not a sound case. Any 
robust algorithm should achieve a balanced page 
arrangement, which remains unaffected by further 
executions of the algorithm, if the relative accesses 
don't change significantly. If we assign the value 1 to 
parameters q and c 2) we can rewrite equation 3 as 

aj= (1;+ ujr t 

This is the case that we will examine. 

First, from all PT and UT collected, we discard- 
ed time values that were too short or too long, A 
very short time represents the case of a user just 
clicking through a link to the next page. A very long 
time (one hour or more) represents the case of a user 
suspending navigation through the site for an exter- 
nal reason (an important e-mail or a meeting, for 
example) and returning much later to continue. 

The site we evaluated comprised 15 pages 

(Page I, Page2 Page 15) organized into the 

binary tree structure depicted in Figure la. (Our 
algorithm also works on graph organizations; see 
the sidebar, "Link Editing for Graph Sites.") Pagel, 
the initial home page, has two children, Page2 and 
Page3. For every n from 1 through 7, Page/J has two 
children: Page2/; and Page(2/J+1). There is only one 
reference to each page (from its parent, with the 
exception of the home page); therefore parameter 
/' has a value of .1 in every case. Our algorithm 
maintains the binary structure in its revision. 



depicted in Figure I b. What changes is the position 
of the pages within the structure. For instance, at 
the end of the link-editing procedure, Page2 occu- 
pies a position at a deeper level of the HTML tree 
than it did initially. 

Table 1 (next page) summarizes the accesses to 
our system prior to link editing. For example, 
Page2 has a mean page time of 27 seconds, 97 
absolute accesses, and 291 relative accesses (RA = 
a* AA, where a - d+ nlr= 1 + 2/1 - 3). 

As we can infer from Table 1 , the initial HTML 
structure is rather inapt. It undervalues pages such 
as Page 10, which exhibits significantly higher RA 
than other pages located closer to the home page. 
Table 2 summarizes the measurements based on the 
revised binary tree. 

As we can easily see from Table 2. the new tree 
organization results in pages with a substantially 
different AA. The new sum of AAs 

SUM(AA) - AA(Pagel) + AA(Page2) + . . . 
+AA(Pagel5) 

exhibits a significant increase: the old SUM(AA) 
was 726; the new SUM (A A) is 866. The mean user 
time, UT, has also increased — from 112 seconds to 
146 seconds — whereas PT fluctuates. Fuithemiore, 
the new page arrangement boosts the mean num- 
ber of pages per user (NP) from 2.04 to 2.67, an 
increase similar to that for UT. 

Even though these measurements are statistical, 
we can draw significant inferences. First, even a sim- 
ple approach to link editing, based on relative page 
accesses, can significantly affect a user s behavior 
towards a particular Web site. Comparing Table I 
to Table 2 (next page), we can infer that we 
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increased the number of accesses to our site by 19 
percent without changing the content of the pages. 
The 31 percent higher NP and UT indicate that the 
new organization seems to be more convenient and 
more attractive to potential users. 

Another important aspect of this reorganization 
is that it can be automated: using a simple iink- 
editing algorithm, we put the Web site structure in 
a more balanced and organized state (from a rela- 
tive-access point of view). 

APPLICATION RESULTS 

The platform requirements of the Soala software 
are moderate. Any Pentium-based personal com- 
puter with 1 6 Mbytes of RAM is sufficient for the 
Soala program to run and calculate page populari- 
ties. The Web server we used in our research was 
the Apache 2.5.1 compiled for Solaris 2.6. Because 
the algorithm is independent of the Web server 
software, any Web server could serve the original 
or modified Web pages. 

We applied our algorithm to several Web sites 
with totally different page structures (each created 
on an ad hoc basis) and calculated trie percentage of 
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which level, and so on) rather 
open ended. 

We did use some logical 
restrictions for the initial struc- 
ture of Theleis, and we made 
some arbitrary decisions about 
what the site should look like. 
Then, we left the site to be 
accessed for 15 days. Next, we 
used the Soala link-editing algo- 
rithm to revise the initial HTML 
structure, and we retested the 
accessibility for the same time 
period (1 5 days, the same days of 
the week) . As Table 4 shows, the 
results were encouraging. There 
was a 14 percent increase in 
absolute accesses. The 1 1 percent 
increase in the average page time 
shows that the site has become 
more attractive to users. 

The links among pages consti- 
tute a "grafo" (a graph), and the 
access to the site is anonymous 
(no login name or password is 
required). Every page includes a 
link to the home page in addition to the alternative 
links a visitor might choose. Furthermore, to tempt 
the user into staying longer, we added some gift 
pages hidden in a deep level of the structure, toward 
the leaves. 

LIMITATIONS OF OUR 
APPROACH 

There are cases in which an HTML rearrangement 
is not feasible, because the structure of the Web 
server follows the dictates of logical restrictions. For 
example, take the case of a site used as a lyrics 
archive. The initial page would be a table of con- 
tents (A-Z. where A leads to all artists whose names 
start with A, and so on); the leaves of the HTML 
tree would be the songs of each artist. Even if a par- 
ticular song has substantially more accesses than the 
others, it cannot become the root of the HTML 
tree, because this would make navigation to the rest 
of the lyrics unjustifiably problematic. 

So Web page content is also important, and the 
administrator should consider it in page rearrange- 
ment. Soala 2,0 tackles such problems by enabling 
the administrator to define stable nodes and stable 
connections, as illustrated in Figure 2. If the 
administrator defines a connection as stable, the 
connected nodes cannot be interchanged. The 
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Figure 2. The iink-editing interface for the Soala software. With a simple mouse 
click, the user can define whether a node — or a link between two nodes— is stable. 
The interface uses four icons to indicate the current state of each node and link. In 
addition, the GUI gives the user a visual representation of the HTML structure. Click- 
ing on a specific node gives the user detailed information about and a description of 
that node and its immediate relatives. 
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same applies to a stable node; it cannot be inter- 
changed with any of its parent or child nodes. 

Another side effect of any link-editing algorithm 
is that changing the positions of the HTML pages 
can render links from external sites (such as search 
engines, subject catalogs, and so on) out of date. 
Such broken links can be annoying for Net surfers, 
and the administrator should bear this effect in 
mind, especially if the link-editing algorithm is exe- 
cuted frequently and i*esults in the replacement: of 
many useful pages. 

D. Ingham and colleagues have proposed an 
object-oriented approach to defeat broken links, 4 
and F. Kappe has proposed a server update proto- 
col to propagate update information. 5 There is a 
much easier method, however, that demands little 
effort from the Web administrator. During the 
link-editing procedure, the Soala software can map 
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the URLs of the pages being replaced and their new 
URLs and put them into a database (either a rela- 
tional database or a simple file accessible from the 
Web server). Then, the administrator can specially 
configure the error handler to check this database 
when a user pursues a nonupdated external link. 
(See http://wwvv^apache.org/docs-1.2/custom- 
error.html for information on how to configure the 
Apache Web server to handle such error condi- 
tions.) Instead of returning the "404 Not found" 
message, the error handler determines whether the 
outdated link has been replaced by another one 
during link editing. If this page exists in the data- 
base, a message informs the user of its new URL 
and automatically redirects the browser to point to 
the new location of the page. If the file does not 
exist, the browser displays an error message. In this 
way, even if the link-edit ing algorithm has modi- 
fied all the Web servers pages and changed their 
URLs, there will be no broken links, even on refer- 
ences from search engines. 

To reduce the additional load on the Web serv- 
er caused by execution of the error script handler, 
we advise the administrator to resubmit the home 
page URL* to the search engine so it can update its 
links. The update will be available to all network 



users after the search engines reorganization inter- 
val, usually a month. 

CONCLUSION 

An awkward arrangement of an HTML tree can 
discourage a potential Internet user from visiting 
and staying at a specific Web site. We have shown 
how a link-editing algorithm can automatically fix a 
poor organization by calculating each pages relative 
popularity. Even a simple approach using relative 
page popularity — the combination of our equations 
1 and 2 — leads to a substantially enhanced scheme. 

In this article, we considered only cases where 
the objective is to make it easier for a user to find 
the requested data: the faster the access, the better 
the organization of the Web server. But what if a 
Web site contains commercial material — adver- 
tisements and so on? In this case, the objective may 
be different: the best organization may be one that 
achieves the highest AA for those pages with a com- 
mercial interest. In these cases, we can associate a 
weight factor w with each page and modify the 
link -editing algorithm so that it promotes pages 
with a higher w more vigorously. We have devel- 
oped a version of our link-editing algorithm to 
handle this association. A beta version, still under 
evaluation, has returned encouraging results. 

Surely there are many different approaches to 
rearranging links between pages and many differ- 
ent ways of perceiving the notion of popularity. For 
example, we could give coefficient a a more com- 
plicated definition than that given in equation 2. 
or we could rearrange the initial page structure 
using a more sophisticated method than our link- 
editing algorithm. Nevertheless, the approach we 
presented in this article should not be underrated; 
besides being easy to analyze, it has also proved ade- 
quate in most cases, considering the percentage 
increase in page accesses. A link -editing algorithm 
that could add new links and discard others (in 
addition to interchanging pages) would probably 
demonstrate a better enhancement. 

Our team is still in the process of identifying sev- 
eral new variables that affect Web page popularity. 
We will make our new results available as soon as 
they are extracted. a 
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